Abstract In this workflow, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The workflow is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.
This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.
Before we start:
If you identify typos, if there are parts that you would like to see expended or clarified, please let me know by telling me directly (during workshops), opening a github issue or by emailing me directly. Please do also briefly specify your background/familiarity with mass spectrometry and/or proteomics (beginner, intermediate or expert) so that I can update accordingly.
In recent years, there we have seen an increase in the number of packages to analyse mass spectrometry and proteomics data for R and Bioconductor, as well as an increase in total number of downloads. See vignette Proteomics packages in Bioconductor for more details and code underlying these figures.
It is also good to highlight that several of these package have become a group efforts, supported by several developers in the community. This post illustrates the various contributions to MSnbase. mzR has benefited by a similar wide range of successful contributions. Both packages, and in particular mzR, are used by many others, and will be described in some detail in this workflow.
This workflow illustrates R / Bioconductor infrastructure for proteomics. Topics covered focus on support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, data processing and analysis:
Links to other packages and references are also documented. In particular, the vignettes included in the RforProteomics package also contains relevant material.
This workflow provides a general introduction to Bioconductor software for mass spectrometry and proteomics. If you are interested in
vignette("pRoloc-tutorial", package = "pRoloc") or online.vignette("msnid_vignette", package = "MSnID") or online. In addition, the vignettes of the msmsTest package describe how to analyse spectral counting data using packages dedicated for the analysis of high throughput sequencing data.vignette("MALDIquant-intro", package = "MALDIquant") and available online.vignette("Cardinal-walkthrough", package = "Cardinal") and online.The follow packages will be used throughout this documents. R version 3.3.1 or higher is required to install all the packages using BiocInstaller::biocLite.
library("mzR")
library("mzID")
library("MSnID")
library("MSnbase")
library("rpx")
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
library("MSGFplus")
library("rols")
library("hpar")
The most convenient way to install all the tutorials requirement (and more related content), is to install RforProteomics with all its dependencies.
library("BiocInstaller")
biocLite("RforProteomics", dependencies = TRUE)
Other packages of interest, such as rTANDEM or MSGFgui will be described later in the document but are not required to execute the code in this workflow.
In Bioconductor version 3.6, there are respectively 92 proteomics, 62 mass spectrometry software packages and 17 mass spectrometry experiment packages. These respective packages can be extracted with the proteomicsPackages(), massSpectrometryPackages() and massSpectrometryDataPackages() and explored interactively, or looked at by exploring the respective biocViews on the Bioconductor web page.
library("RforProteomics")
pp <- proteomicsPackages()
display(pp)
Most community-driven formats described in the table are supported in R. We will see how to read and access these data in the following sections.
## Warning: package 'knitr' was built under R version 3.6.0
| Type | Format | Package |
|---|---|---|
| raw | mzML, mzXML, netCDF, mzData | MSnbase (read and write in version >= 2.3.13) via mzR |
| identification | mzIdentML | mzID (read) and MSnbase (read, via mzR) |
| quantitation | mzQuantML | |
| peak lists | mgf | MSnbase (read) |
| other | mzTab | MSnbase (read) |
Mass spectrometry (MS) is a technology that separates charged molecules (ions) based on their mass to charge ratio (M/Z). It is often coupled to chromatography (liquid LC, but can also be gas-based GC). The time an analytes takes to elute from the chromatography column is the retention time.
A chromatogram, illustrating the total amount of analytes over the retention time.
An mass spectrometer is composed of three components:
When using mass spectrometry for proteomics, the proteins are first digested with a protease such as trypsin. In mass shotgun proteomics, the analytes assayed in the mass spectrometer are peptides.
Often, ions are subjected to more than a single MS round. After a first round of separation, the peaks in the spectra, called MS1 spectra, represent peptides. At this stage, the only information we possess about these peptides are their retention time and their mass-to-charge (we can also infer their charge be inspecting their isotopic envelope, i.e the peaks of the individual isotopes, see below), which is not enough to infer their identify (i.e. their sequence).
In MSMS (or MS2), the settings of the mass spectrometer are set automatically to select a certain number of MS1 peaks (for example 20). Once a narrow M/Z range has been selected (corresponding to one high-intensity peak, a peptide, and some background noise), it is fragmented (using for example collision-induced dissociation (CID), higher energy collisional dissociation (HCD) or electron-transfer dissociation (ETD)). The fragment ions are then themselves separated in the analyser to produce a MS2 spectrum. The unique fragment ion pattern can then be used to infer the peptide sequence using de novo sequencing (when the spectrum is of high enough quality) of using a search engine such as, for example Mascot, MSGF+, …, that will match the observed, experimental spectrum to theoratical spectra (see details below).
Schematics of a mass spectrometer and two rounds of MS.
The animation below show how 25 ions different ions (i.e. having different M/Z values) are separated throughout the MS analysis and are eventually detected (i.e. quantified). The final frame shows the hypothetical spectrum.
Separation and detection of ions in a mass spectrometer.
The figures below illustrate the two rounds of MS. The spectrum on the left is an MS1 spectrum acquired after 21 minutes and 3 seconds of elution. 10 peaks, highlited by dotted vertical lines, were selected for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a red (orange) vertical line on the MS1 spectrum and the fragment spectra are shown on the MS2 spectrum on the top (bottom) right figure.
Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).
The figures below represent the 3 dimensions of MS data: a set of spectra (M/Z and intensity) of retention time, as well as the interleaved nature of MS1 and MS2 (and there could be more levels) data.
MS1 spectra over retention time.
MS2 spectra interleaved between two MS1 spectra.
MS-based proteomics data is disseminated through the ProteomeXchange infrastructure, which centrally coordinates submission, storage and dissemination through multiple data repositories, such as the PRoteomics IDEntifications (PRIDE) database at the EBI for mass spectrometry-based experiments (including quantitative data, as opposed as the name suggests), PASSEL at the ISB for Selected Reaction Monitoring (SRM, i.e. targeted) data and the MassIVE resource. These data can be downloaded within R using the rpx package.
library("rpx")
pxannounced()
## 15 new ProteomeXchange annoucements
## Data.Set Publication.Data Message
## 1 PXD007122 2018-07-10 09:35:52 New
## 2 PXD010346 2018-07-10 04:09:06 New
## 3 PXD010345 2018-07-09 16:41:16 New
## 4 PXD010344 2018-07-09 16:41:06 New
## 5 PXD009495 2018-07-09 07:22:36 New
## 6 PXD009826 2018-07-06 14:39:06 New
## 7 PXD009939 2018-07-06 12:39:44 New
## 8 PXD007761 2018-07-06 12:38:09 Updated information
## 9 PXD005923 2018-07-06 10:00:35 Updated information
## 10 PXD008069 2018-07-06 07:27:24 New
## 11 PXD005973 2018-07-05 17:28:53 New
## 12 PXD008592 2018-07-05 17:25:40 New
## 13 PXD008242 2018-07-05 17:23:04 New
## 14 PXD007573 2018-07-05 13:57:46 New
## 15 PXD007601 2018-07-05 13:54:08 New
Using the unique PXD000001 identifier, we can retrieve the relevant metadata that will be stored in a PXDataset object. The names of the files available in this data can be retrieved with the pxfiles accessor function.
px <- PXDataset("PXD000001")
px
## Object of class "PXDataset"
## Id: PXD000001 with 12 files
## [1] 'F063721.dat' ... [12] 'generated'
## Use 'pxfiles(.)' to see all files.
pxfiles(px)
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "README.txt"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
## [8] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML"
## [9] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [10] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [11] "erwinia_carotovora.fasta"
## [12] "generated"
Other metadata for the px data set:
pxtax(px)
## [1] "Erwinia carotovora"
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
pxref(px)
## [1] "Gatto L, Christoforou A. Using R and Bioconductor for proteomics data analysis. Biochim Biophys Acta. 2014 1844(1 pt a):42-51"
Data files can then be downloaded with the pxget function. Below, we retrieve the raw data file. The file is downloaded2 If the file is already available, it is not downloaded a second time. in the working directory and the name of the file is return by the function and stored in the mzf variable for later use.
fn <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
mzf <- pxget(px, fn)
## Downloading 1 file
## /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML already present.
mzf
## [1] "/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
AnnotationHub is a cloud resource set up and managed by the Bioconductor project that serves various omics datasets. It is possible to contribute and access (albeit currently only a limited number of) proteomics data.
library("AnnotationHub")
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
ah <- AnnotationHub()
## snapshotDate(): 2018-04-30
query(ah, "proteomics")
## AnnotationHub with 4 records
## # snapshotDate(): 2018-04-30
## # $dataprovider: PRIDE
## # $species: Erwinia carotovora
## # $rdataclass: AAStringSet, MSnSet, mzRident, mzRpwiz
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH49006"]]'
##
## title
## AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
## AH49007 | PXD000001: Peptide-level quantitation data
## AH49008 | PXD000001: raw mass spectrometry data
## AH49009 | PXD000001: MS-GF+ identiciation data
ms <- ah[["AH49008"]]
ms
## Mass Spectrometry file handle.
## Filename: 55314
## Number of scans: 7534
The data contains 7534 spectra - 1431 MS1 spectra and 6103 MS2 spectra. The file name, 55314, is not very descriptive because the data originates from the AnnotationHub cloud repository. If the data was read from a local file, is would be named as the mzML (or mzXML) file (see below).
Some data are also distributed through dedicated packages. The msdata, for example, provides some general raw data files relevant for both proteomics and metabolomics.
library("msdata")
## proteomics raw data
proteomics()
## [1] "MRM-standmix-5.mzML.gz"
## [2] "MS3TMT10_01022016_32917-33481.mzML.gz"
## [3] "MS3TMT11.mzML"
## [4] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
## [5] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
## proteomics identification data
ident()
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
More often, such experiment packages distribute processed data; an example of such is the pRolocdata package, that offers quantitative proteomics data.
The MSnbase package provides high-level data abstractions for raw MS data through the MSnExp class and containers for quantification data via the MSnSet class (see below). Both store
spectra (or the [, [[ operators) or exprs;data.frame with pData;data.frame with fData.Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation (see examples below).
MSnExp classThe readMSData will parse the raw data and construct an MS experiment object of class MSnExp. An important argument to readMSData is the mode, which can be "onDisk" or "inMemory". The former doesn’t load the raw data in memory (which is not advised for MS1 data, or when many files are loaded) and is generally the recommended mode. See the benchmarking vignette[^Open it with vignette("benchmarking", package = "MSnbase") or read it online] for details).
library("MSnbase")
## get a small test data
rawFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
basename(rawFile)
## [1] "dummyiTRAQ.mzXML"
msexp <- readMSData(rawFile, msLevel = 2L)
msexp
## MSn experiment data ("MSnExp")
## Object size in memory: 0.18 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 5
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Tue Jul 10 15:18:02 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2 ... F1.S5 (5 total)
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
Spectra can be extracted as a list of Spectrum2 objects with the spectra accessor or as a subset of the original MSnExp data with the [ operator. Individual spectra can be accessed with [[.
length(msexp)
## [1] 5
msexp[1:2]
## MSn experiment data ("MSnExp")
## Object size in memory: 0.07 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 2
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Tue Jul 10 15:18:02 2018
## Data [numerically] subsetted 2 spectra: Tue Jul 10 15:18:02 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
msexp[[2]]
## Object of class "Spectrum2"
## Precursor: 546.9586
## Retention time: 25:2
## Charge: 3
## MSn level: 2
## Peaks count: 1012
## Total ion count: 56758067
The identification results stemming from the same raw data file can then be used to add PSM matches.
## initial feature variable
fData(msexp)
## spectrum
## F1.S1 1
## F1.S2 2
## F1.S3 3
## F1.S4 4
## F1.S5 5
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "dummyiTRAQ.mzid")
basename(identFile)
## [1] "dummyiTRAQ.mzid"
msexp <- addIdentificationData(msexp, identFile)
## additional feature variables
fvarLabels(msexp)
## [1] "spectrum" "acquisition.number"
## [3] "sequence" "chargeState"
## [5] "rank" "passThreshold"
## [7] "experimentalMassToCharge" "calculatedMassToCharge"
## [9] "modNum" "isDecoy"
## [11] "post" "pre"
## [13] "start" "end"
## [15] "DatabaseAccess" "DBseqLength"
## [17] "DatabaseSeq" "DatabaseDescription"
## [19] "idFile" "MS.GF.RawScore"
## [21] "MS.GF.DeNovoScore" "MS.GF.SpecEValue"
## [23] "MS.GF.EValue" "modName"
## [25] "modMass" "modLocation"
## [27] "subOriginalResidue" "subReplacementResidue"
## [29] "subLocation" "nprot"
## [31] "npep.prot" "npsm.prot"
## [33] "npsm.pep"
We see that 3 out of 5 MS2 spectra in the msexp data have been identified; those that haven’t have missing values for the new, id-related feature variables.
fData(msexp)$rank
## [1] 1 1 NA NA 1
fData(msexp)$isDecoy
## [1] FALSE FALSE NA NA FALSE
Exercise Load all MS level data from file
MS3TMT11.mzMLavailable in themsdatapackage usingreadMSData, making sure you setmode = "onDisk", and verify which MS levels (accessible with themsLevelfunction) are centroided (accessible with thecentroided()function). See section Raw data processing for data in profile and centroided (processed) modes.
f <- proteomics(full.names = TRUE, pattern = "MS3TMT11.mzML")
ms <- readMSData(f, mode = "onDisk")
table(centroided(ms), msLevel(ms))
##
## 1 2 3
## FALSE 30 0 0
## TRUE 0 482 482
Spectra and (parts of) experiments can be extracted and plotted.
msexp[[1]]
## Object of class "Spectrum2"
## Precursor: 645.3741
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
plot(msexp[[1]], full=TRUE)
Plotting a object of class Spectrum.
As this data was labeled with iTRAQ4 isobaric tags, we can highlight these four peaks of interest with
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
Plotting a object of class Spectrum with reporter ions.
msexp[1:3]
## MSn experiment data ("MSnExp")
## Object size in memory: 0.11 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 3
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Tue Jul 10 15:18:02 2018
## Data [numerically] subsetted 3 spectra: Tue Jul 10 15:18:05 2018
## MSnbase version: 2.6.1
## - - - Meta data - - -
## phenoData
## rowNames: dummyiTRAQ.mzXML
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: F1.S1 F1.S2 F1.S3
## fvarLabels: spectrum acquisition.number ... npsm.pep (33 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
plot(msexp[1:3], full=TRUE)
Plotting a object of class MSnExp
In the examples ablove, we only used a single file as input to readMSData, but multiple file can be read into a single MSnExp object. The origin of the spectra can be accessed with the fromFile function:
fromFile(msexp)
## F1.S1 F1.S2 F1.S3 F1.S4 F1.S5
## 1 1 1 1 1
Exercise Repeat the previous combination of raw and identification data with the TMT_Erwinia
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gzandTMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzidfiles frommsdata. Retain only MS level data; this can be done either when reading the data in (see themsLevelargument in?readMSData) or can be done afterwards by filtering the MS levels withfilterMsLevel.
## read raw data
rwf <- proteomics(pattern = "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz", full.names = TRUE)
tmterw <- readMSData(rwf, mode = "onDisk")
## or, only read MS2-leve data
## tmterw <- readMSData(rwf, mode = "onDisk", msLevel = 2L)
## add identification data
idf <- ident(full.names = TRUE)
## Error: Character input expected
tmterw <- addIdentificationData(tmterw, idf)
## Error in addIdentificationData(tmterw, idf): object 'idf' not found
tmterw2 <- filterMsLevel(tmterw, 2L)
## It is also possible to chain operations
library("magrittr")
tmterw2 <- rwf %>%
readMSData(mode = "onDisk") %>%
addIdentificationData(idf) %>%
filterMsLevel(2L)
Exercise Still using the
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210data from the previous exercise, identify the index of the MS2 spectrum with the highest precursor intensity (see theprecursorIntensityfeature variable) and plot it as illustrated above.
i <- which.max(precursorIntensity(tmterw2))
sp <- tmterw2[[i]]
plot(sp, full = TRUE)
This section illustrates the underlying infrastructure from the mzR package, that is used by MSnbase under the hood. It is recommended to use the high level interfaces, as it supports multiple files and does data integrity checks throughout the data processing.
The mzR package provides an interface to the proteowizard C/C++ code base to access various raw data files, such as mzML, mzXML, netCDF, and mzData. The data is accessed on-disk, i.e it is not loaded entirely in memory, and only when explicitly requested. The three main functions are openMSfile to create a file handle to a raw data file, header to extract metadata about the spectra contained in the file and peaks to extract one or multiple spectra of interest. Other functions such as instrumentInfo, or runInfo can be used to gather general information about a run.
Below, we access the raw data file downloaded in the previous section and open a file handle that will allow us to extract data and metadata of interest.
library("mzR")
basename(mzf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
ms <- openMSfile(mzf)
ms
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## Number of scans: 7534
The object loaded from AnnotationHub in the previous section is of the same type, and was also created by the openMSfile function. All operations below can equally be applied to it.
The header function returns the metadata of all available peaks:
hd <- header(ms)
dim(hd)
## [1] 7534 25
names(hd)
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum" "injectionTime"
## [23] "filterString" "spectrumId"
## [25] "centroided"
We can extract metadata and scan data for scan 1000 as follows:
hd[1000, ]
## seqNum acquisitionNum msLevel polarity peaksCount totIonCurrent
## 1000 1000 1000 2 1 274 1048554
## retentionTime basePeakMZ basePeakIntensity collisionEnergy
## 1000 1106.916 136.061 164464 45
## ionisationEnergy lowMZ highMZ precursorScanNum precursorMZ
## 1000 0 104.5467 1370.758 992 683.0817
## precursorCharge precursorIntensity mergedScan mergedResultScanNum
## 1000 2 689443.7 0 0
## mergedResultStartScanNum mergedResultEndScanNum injectionTime
## 1000 0 0 55.21463
## filterString
## 1000 FTMS + p NSI d Full ms2 683.08@hcd45.00 [100.00-1380.00]
## spectrumId centroided
## 1000 controllerType=0 controllerNumber=1 scan=1000 TRUE
head(peaks(ms, 1000))
## [,1] [,2]
## [1,] 104.5467 308.9326
## [2,] 104.5684 308.6961
## [3,] 108.8340 346.7183
## [4,] 109.3928 365.1236
## [5,] 110.0345 616.7905
## [6,] 110.0703 429.1975
plot(peaks(ms, 1000), type = "h", xlab = "M/Z", ylab = "Intensity")
Manual extraction and plotting of an MS spectrum
The RforProteomics package distributes a small identification result file that we load and parse using infrastructure from the mzR package.
idf <- msdata::ident(full.names = TRUE)
basename(idf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
The easiest way to read identification data in mzIdentML (often abbreviated with mzid) into R is to read it with readMzIdData, that will parse it, process it, and return a data.frame:
iddf <- readMzIdData(idf)
head(iddf)
## sequence spectrumID
## 1 RQCRTDFLNYLR controllerType=0 controllerNumber=1 scan=2949
## 2 ESVALADQVTCVDWRNRKATKK controllerType=0 controllerNumber=1 scan=6534
## 3 KELLCLAMQIIR controllerType=0 controllerNumber=1 scan=5674
## chargeState rank passThreshold experimentalMassToCharge
## 1 3 1 TRUE 548.2856
## 2 2 1 TRUE 1288.1528
## 3 2 1 TRUE 744.4109
## calculatedMassToCharge modNum isDecoy post pre start end DatabaseAccess
## 1 547.9474 1 FALSE V R 574 585 ECA2006
## 2 1288.1741 1 FALSE G R 69 90 ECA1676
## 3 744.4255 1 TRUE Q R 131 142 XXX_ECA2855
## DBseqLength DatabaseSeq
## 1 1295
## 2 110
## 3 157
## DatabaseDescription
## 1 ECA2006 ATP-dependent helicase
## 2 ECA1676 putative growth inhibitory protein
## 3
## acquisitionNum
## 1 2949
## 2 6534
## 3 5674
## spectrumFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## idFile
## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue
## 1 10 101 4.617121e-08 0.1321981
## 2 12 121 7.255875e-08 0.2087481
## 3 8 74 9.341019e-08 0.2674533
## MS.GF.QValue MS.GF.PepQValue modName modMass modLocation
## 1 0.5254237 0.5490196 Carbamidomethyl 57.02146 3
## 2 0.6103896 0.6231884 Carbamidomethyl 57.02146 11
## 3 0.6250000 0.6363636 Carbamidomethyl 57.02146 5
## subOriginalResidue subReplacementResidue subLocation
## 1 <NA> <NA> NA
## 2 <NA> <NA> NA
## 3 <NA> <NA> NA
## [ reached getOption("max.print") -- omitted 3 rows ]
When adding identification data with the addIdentificationData function as shown above, the data is read with readMzIdData, as shown above, and is cleaned up:
This behaviour can be replicates with the filterIdentificationDataFrame function.
Along the lines of what is available for raw data, the parsing of this XML-based format comes from mzR. A file handle to mzIdentML files can be created with the openIDfile function. As for raw data, the underlying C/C++ code comes from the proteowizard.
library("mzR")
id1 <- openIDfile(idf)
id1
## Identification file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
## Number of psms: 5759
Various data can be extracted from the identification object. The peptide spectrum matches (PSMs) and the identification scores can be accessed as a data.frame with psms and score respectively.
softwareInfo(id1)
## [1] "MS-GF+ Beta (v10072) "
## [2] "ProteoWizard MzIdentML 3.0.501 ProteoWizard"
enzymes(id1)
## name nTermGain cTermGain minDistance missedCleavages
## 1 Trypsin 0 1000
fid1 <- mzR::psms(id1)
head(fid1)
## spectrumID chargeState rank
## 1 controllerType=0 controllerNumber=1 scan=5782 3 1
## 2 controllerType=0 controllerNumber=1 scan=6037 3 1
## 3 controllerType=0 controllerNumber=1 scan=5235 3 1
## 4 controllerType=0 controllerNumber=1 scan=5397 3 1
## 5 controllerType=0 controllerNumber=1 scan=6075 3 1
## passThreshold experimentalMassToCharge calculatedMassToCharge
## 1 TRUE 1080.2325 1080.2321
## 2 TRUE 1002.2089 1002.2115
## 3 TRUE 1189.2836 1189.2800
## 4 TRUE 960.5365 960.5365
## 5 TRUE 1264.3409 1264.3419
## sequence modNum isDecoy post pre start end
## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR 0 FALSE S R 50 84
## 2 TQVLDGLINANDIEVPVALIDGEIDVLR 0 FALSE R K 288 315
## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR 0 FALSE A R 192 224
## 4 SQILQQAGTSVLSQANQVPQTVLSLLR 0 FALSE - R 264 290
## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR 0 FALSE F R 119 153
## DatabaseAccess DBseqLength DatabaseSeq
## 1 ECA1932 155
## 2 ECA1147 434
## 3 ECA0013 295
## 4 ECA1731 290
## 5 ECA1443 298
## DatabaseDescription acquisitionNum
## 1 ECA1932 outer membrane lipoprotein 5782
## 2 ECA1147 trigger factor 6037
## 3 ECA0013 ribose-binding periplasmic protein 5235
## 4 ECA1731 flagellin 5397
## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
## [ reached getOption("max.print") -- omitted 1 row ]
sc1 <- mzR::score(id1)
head(sc1)
## spectrumID MS.GF.RawScore
## 1 controllerType=0 controllerNumber=1 scan=5782 147
## 2 controllerType=0 controllerNumber=1 scan=6037 214
## 3 controllerType=0 controllerNumber=1 scan=5235 211
## 4 controllerType=0 controllerNumber=1 scan=5397 154
## 5 controllerType=0 controllerNumber=1 scan=6075 188
## 6 controllerType=0 controllerNumber=1 scan=5761 123
## MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue MS.GF.QValue
## 1 174 3.764831e-27 1.086033e-20 0
## 2 245 6.902626e-26 1.988774e-19 0
## 3 264 1.778789e-25 5.129649e-19 0
## 4 178 1.792541e-24 5.163566e-18 0
## 5 252 1.510364e-23 4.356914e-17 0
## 6 138 1.618941e-23 4.658952e-17 0
## MS.GF.PepQValue
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
The mzID package, has similar functionality to parse identification files, and was the first one to provide such capabilities in R. The main difference with mzR is that is parses the files using the XMLpackage and reads the whole data into memory rather than relying on proteowizard, and is slower.
Exercise Data wrangling with identification data; the standard tidyverse tools are fit for purpose here. Extract and combine the PSMs and their scores as described above and combine them. From the available data, calculate the length of each peptide (you can use
ncharwith the peptide sequencesequence) and the number of peptides for each protein (DatabaseDescription). Plot the length of the proteins for their respective number of peptides. Optionally, stratify the plot by the peptide e-value score (MS.GF.EValue) using for examplecutto define bins.
suppressPackageStartupMessages(library("dplyr"))
## Warning: package 'dplyr' was built under R version 3.6.0
iddf <- as_tibble(iddf) %>%
mutate(peplen = nchar(as.character(sequence)))
## Warning: package 'bindrcpp' was built under R version 3.6.0
npeps <- iddf %>%
group_by(DatabaseDescription) %>%
tally
iddf <- full_join(iddf, npeps)
## Joining, by = "DatabaseDescription"
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.6.0
ggplot(ids, aes(x = n, y = DBseqLength)) + geom_point()
## Error in ggplot(ids, aes(x = n, y = DBseqLength)): object 'ids' not found
ids$evalBins <- cut(ids$MS.GF.EValue, summary(ids$MS.GF.EValue))
## Error in cut(ids$MS.GF.EValue, summary(ids$MS.GF.EValue)): object 'ids' not found
ggplot(ids, aes(x = n, y = DBseqLength, color = peplen)) +
geom_point() +
facet_wrap(~ evalBins)
## Error in ggplot(ids, aes(x = n, y = DBseqLength, color = peplen)): object 'ids' not found
While searches are generally performed using third-party software independently of R or can be started from R using a system call, the MSGFplus package enables to perform a search using the MSGF+ engine, as illustrated below.
We search the TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML file against the fasta file from PXD000001 using MSGFplus.
We first download the fasta files from ProteomeXchange
fas <- pxget(px, "erwinia_carotovora.fasta")
## Downloading 1 file
basename(fas)
## [1] "erwinia_carotovora.fasta"
Below, we setup and run the search3 In the runMSGF call, the memory allocated to the java virtual machine is limited to 1GB. In general, there is no need to specify this argument, unless you experience an error regarding the maximum heap size..
library("MSGFplus")
msgfpar <- msgfPar(database = fas,
instrument = 'HighRes',
tda = TRUE,
enzyme = 'Trypsin',
protocol = 'iTRAQ')
idres <- runMSGF(msgfpar, mzf, memory=1000)
## '/usr/bin/java' -Xmx1000M -jar '/home/lg390/R/x86_64-pc-linux-gnu-library/3.4/MSGFplus/MSGFPlus/MSGFPlus.jar' -s '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML' -o '/home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid' -d '/home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta' -tda 1 -inst 1 -e 1 -protocol 2
##
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid... DONE!
idres
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lg390/Documents/Teaching/bioc-ms-prot/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
##
## Database: /home/lg390/Documents/Teaching/bioc-ms-prot/erwinia_carotovora.fasta
##
## Number of scans: 5343
## Number of PSM's: 5656
## identification file (needed below)
basename(mzID::files(idres)$id)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
A graphical interface to perform the search the data and explore the results is also available:
library("MSGFgui")
MSGFgui()
The MSGFgui interface
The rTANDEM package can be used to perform a search with XTandem software.
The MSnID package can be used for post-search filtering of MS/MS identifications. One starts with the construction of an MSnID object that is populated with identification results that can be imported from a data.frame or from mzIdenML files. Here, we will use the example identification data provided with the package.
mzids <- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
basename(mzids)
## [1] "c_elegans.mzid.gz"
We start by loading the package, initialising the MSnID object, and add the identification result from our mzid file (there could of course be more that one).
library("MSnID")
msnid <- MSnID(".")
## Note, the anticipated/suggested columns in the
## peptide-to-spectrum matching results are:
## -----------------------------------------------
## accession
## calculatedMassToCharge
## chargeState
## experimentalMassToCharge
## isDecoy
## peptide
## spectrumFile
## spectrumID
msnid <- read_mzIDs(msnid, mzids)
## Loaded cached data
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 12263 at 36 % FDR
## #peptides: 9489 at 44 % FDR
## #accessions: 7414 at 76 % FDR
Printing the MSnID object returns some basic information such as
The package then enables to define, optimise and apply filtering based for example on missed cleavages, identification scores, precursor mass errors, etc. and assess PSM, peptide and protein FDR levels. To properly function, it expects to have access to the following data
## [1] "accession" "calculatedMassToCharge"
## [3] "chargeState" "experimentalMassToCharge"
## [5] "isDecoy" "peptide"
## [7] "spectrumFile" "spectrumID"
which are indeed present in our data:
names(msnid)
## [1] "spectrumID" "scan number(s)"
## [3] "acquisitionNum" "passThreshold"
## [5] "rank" "calculatedMassToCharge"
## [7] "experimentalMassToCharge" "chargeState"
## [9] "MS-GF:DeNovoScore" "MS-GF:EValue"
## [11] "MS-GF:PepQValue" "MS-GF:QValue"
## [13] "MS-GF:RawScore" "MS-GF:SpecEValue"
## [15] "AssumedDissociationMethod" "IsotopeError"
## [17] "isDecoy" "post"
## [19] "pre" "end"
## [21] "start" "accession"
## [23] "length" "description"
## [25] "pepSeq" "modified"
## [27] "modification" "idFile"
## [29] "spectrumFile" "databaseFile"
## [31] "peptide"
Here, we summarise a few steps and redirect the reader to the package’s vignette for more details:
Cleaning irregular cleavages at the termini of the peptides and missing cleavage site within the peptide sequences. The following two function call create the new numMisCleavages and numIrrCleabages columns in the MSnID object
msnid <- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])")
Now, we can use the apply_filter function to effectively apply filters. The strings passed to the function represent expressions that will be evaluated, this keeping only PSMs that have 0 irregular cleavages and 2 or less missed cleavages.
msnid <- apply_filter(msnid, "numIrregCleavages == 0")
msnid <- apply_filter(msnid, "numMissCleavages <= 2")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 7838 at 17 % FDR
## #peptides: 5598 at 23 % FDR
## #accessions: 3759 at 53 % FDR
Using "calculatedMassToCharge" and "experimentalMassToCharge", the mass_measurement_error function calculates the parent ion mass measurement error in parts per million.
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2184.0640 -0.6992 0.0000 17.6146 0.7512 2012.5178
We then filter any matches that do not fit the +/- 20 ppm tolerance
msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -19.7797 -0.5866 0.0000 -0.2970 0.5713 19.6758
Filtering of the identification data will rely on
msnid$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`)
msnid$absParentMassErrorPPM <- abs(mass_measurement_error(msnid))
MS2 filters are handled by a special MSnIDFilter class objects, where individual filters are set by name (that is present in names(msnid)) and comparison operator (>, <, = , …) defining if we should retain hits with higher or lower given the threshold and finally the threshold value itself.
filtObj <- MSnIDFilter(msnid)
filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
filtObj$msmsScore <- list(comparison=">", threshold=10.0)
show(filtObj)
## MSnIDFilter object
## (absParentMassErrorPPM < 10) & (msmsScore > 10)
We can then evaluate the filter on the identification data object, which return the false discovery rate and number of retained identifications for the filtering criteria at hand.
evaluate_filter(msnid, filtObj)
## fdr n
## PSM 0 3807
## peptide 0 2455
## accession 0 1009
Rather than setting filtering values by hand, as shown above, these can be set automativally to meet a specific false discovery rate.
filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
method="Grid", level="peptide",
n.iter=500)
show(filtObj.grid)
## MSnIDFilter object
## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)
evaluate_filter(msnid, filtObj.grid)
## fdr n
## PSM 0.004097561 5146
## peptide 0.006447651 3278
## accession 0.021996616 1208
Filters can eventually be applied (rather than just evaluated) using the apply_filter function.
msnid <- apply_filter(msnid, filtObj.grid)
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5146 at 0.41 % FDR
## #peptides: 3278 at 0.64 % FDR
## #accessions: 1208 at 2.2 % FDR
And finally, identifications that matched decoy and contaminant protein sequences are removed
msnid <- apply_filter(msnid, "isDecoy == FALSE")
msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
show(msnid)
## MSnID object
## Working directory: "."
## #Spectrum Files: 1
## #PSMs: 5117 at 0 % FDR
## #peptides: 3251 at 0 % FDR
## #accessions: 1179 at 0 % FDR
The resulting filtered identification data can be exported to a data.frame or to a dedicated MSnSet data structure for quantitative MS data, described below, and further processed and analyses using appropriate statistical tests.
The importance of flexible access to specialised data becomes visible in the figure below4 Figure and code taken from the RforProteomics visualisation vignette. Not only can we access specific data and understand/visualise them, but we can transverse all the data and extracted/visualise/understand structured slices of data.
Visual summary of a full round of MS1 and MS2 acquisition for specific MS1 spectrum.
In this code chunks we start by selecting relevant spectra of interest. We will focus on the first MS1 spectrum acquired after 30 minutes of retention time.
## (1) Open raw data file
ms <- openMSfile(mzf)
## (2) Extract the header information
hd <- header(ms)
## (3) MS1 spectra indices
ms1 <- which(hd$msLevel == 1)
## (4) Select MS1 spectra with retention time between 30 and 35 minutes
rtsel <- hd$retentionTime[ms1] / 60 > 30 & hd$retentionTime[ms1] / 60 < 35
## (5) Indices of the 1st and 2nd MS1 spectra after 30 minutes
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
## (6) Interleaved MS2 spectra
ms2 <- (i+1):(j-1)
Now now extract and plot all relevant information:
chromatogram function.plot(chromatogram(ms)[[1]], type = "l")
A chromatogram.
plot(chromatogram(ms)[[1]], type = "l")
abline(v = hd[i, "retentionTime"]/60, col = "red")
A chromatogram with highlighted retention time of interest.
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
The MS1 spectrum at the retention time of interest.
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
The MS1 spectrum at the retention time of interest, highlighting the peptides selected for MS2.
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5))
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
Isotopic envelope of a peptide.
length(ms2) MS2 spectra highlighted above.par(mfrow = c(5, 2), mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
MS2 spectra of all selected precursors from the previous MS1 spectrum.
Putting it all together:
layout(lout)
par(mar=c(4,2,1,1))
plot(chromatogram(ms)[[1]], type = "l")
abline(v = hd[i, "retentionTime"]/60, col = "red")
par(mar = c(3, 2, 1, 0))
plot(peaks(ms, i), type = "l", xlim = c(400, 1000))
legend("topright", bty = "n",
legend = paste0(
"Acquisition ", hd[i, "acquisitionNum"], "\n",
"Retention time ", formatRt(hd[i, "retentionTime"])))
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"],
col = c("#FF000080",
rep("#12121280", 9)))
par(mar = c(3, 0.5, 1, 1))
plot(peaks(ms, i), type = "l", xlim = c(521, 522.5),
yaxt = "n")
abline(h = 0)
abline(v = hd[ms2, "precursorMZ"], col = "#FF000080")
par(mar = c(2, 2, 0, 1))
for (ii in ms2) {
p <- peaks(ms, ii)
plot(p, xlab = "", ylab = "", type = "h", cex.axis = .6)
legend("topright", legend = paste0("Prec M/Z\n",
round(hd[ii, "precursorMZ"], 2)),
bty = "n", cex = .8)
}
Visual summary of a full round of MS1 and MS2 acquisition for specific MS1 spectrum.
Below, we illustrate some additional visualisation and animations of raw MS data, also taken from the RforProteomics visualisation vignette. On the left, we have a heatmap visualisation of a MS map and a 3 dimensional representation of the same data. On the right, 2 MS1 spectra in blue and the set of interleaves 10 MS2 spectra.
## (1) MS space heaptmap
M <- MSmap(ms, ms1[rtsel], 521, 523, .005, hd)
## 1
ff <- colorRampPalette(c("yellow", "steelblue"))
trellis.par.set(regions=list(col=ff(100)))
m1 <- plot(M, aspect = 1, allTicks = FALSE)
## (2) Same data as (1), in 3 dimenstion
M@map[msMap(M) == 0] <- NA
m2 <- plot3D(M, rgl = FALSE)
## (3) The 2 MS1 and 10 interleaved MS2 spectra from above
i <- ms1[which(rtsel)][1]
j <- ms1[which(rtsel)][2]
M2 <- MSmap(ms, i:j, 100, 1000, 1, hd)
## 1
m3 <- plot3D(M2)
grid.arrange(m1, m2, m3, ncol = 3)
Plotting MS maps along retention time, MZ range and intensity.
Below, we have animations build from extracting successive slices as above.
Annotated spectra and comparing spectra.
par(mfrow = c(1, 2))
itraqdata2 <- pickPeaks(itraqdata, verbose = FALSE) ## centroiding
s <- "SIGFEGDSIGR"
plot(itraqdata2[[14]], s, main = s)
plot(itraqdata2[[25]], itraqdata2[[28]], sequences = rep("IMIDLDGTENK", 2))
Annotating and comparing MS2 spectra.
The annotation of spectra is obtained by simulating fragmentation of a peptide and matching observed peaks to fragments:
calculateFragments("SIGFEGDSIGR")
## mz ion type pos z seq
## 1 88.03931 b1 b 1 1 S
## 2 201.12337 b2 b 2 1 SI
## 3 258.14483 b3 b 3 1 SIG
## 4 405.21324 b4 b 4 1 SIGF
## 5 534.25583 b5 b 5 1 SIGFE
## 6 591.27729 b6 b 6 1 SIGFEG
## 7 706.30423 b7 b 7 1 SIGFEGD
## 8 793.33626 b8 b 8 1 SIGFEGDS
## 9 906.42032 b9 b 9 1 SIGFEGDSI
## 10 963.44178 b10 b 10 1 SIGFEGDSIG
## 11 175.11895 y1 y 1 1 R
## 12 232.14041 y2 y 2 1 GR
## 13 345.22447 y3 y 3 1 IGR
## 14 432.25650 y4 y 4 1 SIGR
## 15 547.28344 y5 y 5 1 DSIGR
## 16 604.30490 y6 y 6 1 GDSIGR
## [ reached getOption("max.print") -- omitted 16 rows ]
Visualising a pair of spectra means that we can access them, and that, in addition to plotting, we can manipulate them and perform computations. The two spectra corresponding to the IMIDLDGTENK peptide, for example have 22 common peaks, a correlation of 0.198 and a dot product of 0.21 (see ?compareSpectra for details).
There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.
| Label-free | Labelled | |
|---|---|---|
| MS1 | XIC | SILAC, 15N |
| MS2 | Counting | iTRAQ, TMT |
In terms of raw data quantitation, most efforts have been devoted to MS2-level quantitation. Label-free XIC quantitation has however been addressed in the frame of metabolomics data processing by the xcms infrastructure.
Below is a list of suggested packages for some common proteomics quantitation technologies:
An MSnExp is converted to an MSnSet by the quantitation method. Below, we use the iTRAQ 4-plex isobaric tagging strategy (defined by the iTRAQ4 parameter; other tags are available: see ?ReporterIons) and the max method to calculate the use the maximum of the reporter peak for quantitation.
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
MS2 spectrum and it’s iTRAQ4 reporter ions.
msset <- quantify(msexp, method = "max", reporters = iTRAQ4)
The figure below give a schematics of an MSnSet instance and the relation between the assay data and the respective feature and sample metadata, accessible respectively with the exprs, fData and pData functions.
MSnSet structure
New columns can be added to the metadata slots.
exprs(msset)
## iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
## F1.S1 706555.7 685055.1 929016.1 668245.2
## F1.S2 260663.7 212745.0 163782.8 239142.7
## F1.S3 2213566.0 2069209.6 2204032.2 2331846.8
## F1.S4 616043.4 705976.6 671828.8 666845.6
## F1.S5 1736128.2 1787622.5 1795311.8 1825523.0
pData(msset)
## mz reporters
## iTRAQ4.114 114.1112 iTRAQ4
## iTRAQ4.115 115.1083 iTRAQ4
## iTRAQ4.116 116.1116 iTRAQ4
## iTRAQ4.117 117.1150 iTRAQ4
pData(msset)$groups <- rep(c("Treat", "Cond"), each = 2)
pData(msset)
## mz reporters groups
## iTRAQ4.114 114.1112 iTRAQ4 Treat
## iTRAQ4.115 115.1083 iTRAQ4 Treat
## iTRAQ4.116 116.1116 iTRAQ4 Cond
## iTRAQ4.117 117.1150 iTRAQ4 Cond
Another useful slot is processingData, accessed with processingData(.), that records all the processing that objects have undergone since their creation.
processingData(msset)
## - - - Processing information - - -
## Data loaded: Tue Jul 10 15:18:02 2018
## iTRAQ4 quantification by max: Tue Jul 10 15:18:30 2018
## MSnbase version: 2.6.1
See also The isobar package supports quantitation from centroided mgf peak lists or its own tab-separated files that can be generated from Mascot and Phenyx vendor files.
Other MS2 quantitation methods available in quantify include the (normalised) spectral index SI and (normalised) spectral abundance factor SAF or simply a simple count method5 The code below is for illustration only - it doesn’t make much sense to perform any of these quantitations on such a multiplexed data.
exprs(si <- quantify(msexp, method = "SIn"))
## dummyiTRAQ.mzXML
## ECA0510 0.0006553518
## ECA0984 0.0035384487
## ECA1028 0.0002684726
exprs(saf <- quantify(msexp, method = "NSAF"))
## dummyiTRAQ.mzXML
## ECA0510 0.4306167
## ECA0984 0.3094475
## ECA1028 0.2599359
Note that spectra that have not been assigned any peptide (NA) or that match non-unique peptides (npsm > 1) are discarded in the counting process.
As shown above, the MSnID package enables to explore and assess the confidence of identification data using mzid files. A subset of all peptide-spectrum matches, that pass a specific false discovery rate threshold can them be converted to an MSnSet, where the number of peptide occurrences are used to populate the assay data.
The Proteomics Standard Initiative (PSI) mzTab file format is aimed at providing a simpler (than XML formats) and more accessible file format to the wider community. It is composed of a key-value metadata section and peptide/protein/small molecule tabular sections. These data can be imported with the readMzTabData function6 We specify version 0.9 (which generates the warning) to fit with the version of that file. For recent files, the version argument should be ignored to use the importer for the current file version 1.0..
mztf <- pxget(px, "F063721.dat-mztab.txt")
## Downloading 1 file
(mzt <- readMzTabData(mztf, what = "PEP", version = "0.9"))
## Warning: Version 0.9 is deprecated. Please see '?readMzTabData' and '?
## MzTab' for details.
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Mon Jun 19 23:10:32 2017
## MSnbase version: 2.3.6
It is also possible to import arbitrary spreadsheets as MSnSet objects into R with the readMSnSet2 function. The main 2 arguments of the function are (1) a text-based spreadsheet and (2) column names of indices that identify the quantitation data. The latter can be queried with the getEcols function.
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
## [1] "\"Protein ID\"" "\"FBgn\""
## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\""
## [5] "\"Mascot score\"" "\"No. peptides quantified\""
## [7] "\"area 114\"" "\"area 115\""
## [9] "\"area 116\"" "\"area 117\""
## [11] "\"PLS-DA classification\"" "\"Peptide sequence\""
## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\""
## [15] "\"pd.2013\"" "\"pd.markers\""
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
## area.114 area.115 area.116 area.117
## 1 0.379000 0.281000 0.225000 0.114000
## 2 0.420000 0.209667 0.206111 0.163889
## 3 0.187333 0.167333 0.169667 0.476000
## 4 0.247500 0.253000 0.320000 0.179000
## 5 0.216000 0.183000 0.342000 0.259000
## 6 0.072000 0.212333 0.573000 0.142667
head(fData(res))
## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86
## 2 CG10067 FBgn0000044 Act57B 5 222.40
## 3 CG10077 FBgn0035720 CG10077 5 219.65
## 4 CG10079 FBgn0003731 Egfr 2 86.39
## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10
## 6 CG10130 FBgn0010638 Sec61beta 2 79.90
## No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1 1 PM
## 2 9 PM
## 3 3
## 4 2 PM
## 5 1 GGVFDTIQK
## 6 3 ER/Golgi
## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers
## 1 PM unknown
## 2 PM unknown
## 3 unknown unknown
## 4 PM unknown
## 5 626.887 2 Phenotype 1 unknown
## 6 ER/Golgi ER
For raw data processing look at MSnbase’s clean, smooth, pickPeaks, removePeaks and trimMz for MSnExp and spectra processing methods.
As an illustration, we show the pickPeaks function on the itraqdata data. Centoiding transforms the distribution of M/Z values measured for an ion (i.e. a set of M/Z and intensities, first figure below) into a single M/Z and intensity pair of values (second figure below).
library("ggplot2") ## for coord_cartesian
data(itraqdata)
plot(itraqdata[[10]], full = TRUE) +
coord_cartesian(xlim = c(915, 925))
Peak picking: profile mode.
itraqdata2 <- pickPeaks(itraqdata)
plot(itraqdata2[[10]], full = TRUE, w1 = 0.05) +
coord_cartesian(xlim = c(915, 925))
Peak picking: centroided.
The MALDIquant and xcms packages also features a wide range of raw data processing methods on their own ad hoc data instance types.
Each different types of quantitative data will require their own pre-processing and normalisation steps. Both isobar and MSnbase allow to correct for isobaric tag impurities normalise the quantitative data.
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4)
impurities <- matrix(c(0.929, 0.059, 0.002, 0.000,
0.020, 0.923, 0.056, 0.001,
0.000, 0.030, 0.924, 0.045,
0.000, 0.001, 0.040, 0.923),
nrow = 4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt <- purityCorrect(qnt, impurities)
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Tue Jul 10 15:18:33 2018
## Purity corrected: Tue Jul 10 15:18:33 2018
## MSnbase version: 1.1.22
Various normalisation methods can be applied the MSnSet instances using the normalise method: variance stabilisation (vsn), quantile (quantiles), median or mean centring (center.media or center.mean), …
qnt <- normalise(qnt, "quantiles")
processingData(qnt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Tue Jul 10 15:18:33 2018
## Purity corrected: Tue Jul 10 15:18:33 2018
## Normalised (quantiles): Tue Jul 10 15:18:33 2018
## MSnbase version: 1.1.22
The combineFeatures method combines spectra/peptides quantitation values into protein data. The grouping is defined by the groupBy parameter, which is generally taken from the feature metadata (protein accessions, for example).
gb <- fData(qnt)$ProteinDescription
prt <- combineFeatures(qnt, groupBy = gb, fun = "median")
## Your data contains missing values. Please read the relevant
## section in the combineFeatures manual page for details the effects
## of missing values on data aggregation.
processingData(prt)
## - - - Processing information - - -
## Data loaded: Wed May 11 18:54:39 2011
## Updated from version 0.3.0 to 0.3.1 [Fri Jul 8 20:23:25 2016]
## iTRAQ4 quantification by trapezoidation: Tue Jul 10 15:18:33 2018
## Purity corrected: Tue Jul 10 15:18:33 2018
## Normalised (quantiles): Tue Jul 10 15:18:33 2018
## Combined 55 features into 39 using median: Tue Jul 10 15:18:33 2018
## MSnbase version: 2.6.1
Finally, proteomics data analysis is generally hampered by missing values. Missing data imputation is a sensitive operation whose success will be guided by many factors, such as degree and (non-)random nature of the missingness.
Below, we load an MSnSet with missing values, count the number missing and non-missing values.
data(naset)
table(is.na(naset))
##
## FALSE TRUE
## 10254 770
The naplot figure will reorder cells within the data matrix so that the experiments and features with many missing values will be grouped towards the top and right of the heatmap, and barplots at the top and right summarise the number of missing values in the respective samples (column) and rows (rows).
naplot(naset)
Overview of missing values.
The importance of missing values in a dataset will depend on the quantitation technology employed. Label-free quantitation in particular can suffer from a very high number of missing values.
Missing value in MSnSet instances can be filtered out with the filterNA functions. By default, it removes features that contain at least NA value.
## remove features with missing values
tmp <- filterNA(naset)
processingData(tmp)
## - - - Processing information - - -
## Subset [689,16][301,16] Tue Jul 10 15:18:33 2018
## Removed features with more than 0 NAs: Tue Jul 10 15:18:33 2018
## Dropped featureData's levels Tue Jul 10 15:18:33 2018
## MSnbase version: 1.15.6
It is of course possible to impute missing values (?impute). This is however not a straightforward thing, as is likely to dramatically fail when a high proportion of data is missing (10s of %)7 Note that when using limma for instance, downstream analyses can handle missing values. Still, it is recommended to explore missingness as part of the exploratory data analysis.. But also, there are two types of mechanisms resulting in missing values in LC/MSMS experiments.
Missing values resulting from absence of detection of a feature, despite ions being present at detectable concentrations. For example in the case of ion suppression or as a result from the stochastic, data-dependent nature of the MS acquisition method. These missing value are expected to be randomly distributed in the data and are defined as missing at random (MAR) or missing completely at random (MCAR).
Biologically relevant missing values, resulting from the absence or the low abundance of ions (below the limit of detection of the instrument). These missing values are not expected to be randomly distributed in the data and are defined as missing not at random (MNAR).
Random and non-random missing values.
Different imputation methods are more appropriate to different classes of missing values (as documented in this paper). Values missing at random, and those missing not at random should be imputed with different methods.
Root-mean-square error (RMSE) observations standard deviation ratio (RSR), KNN and MinDet imputation. Lower (blue) is better. (See here for details)
Generally, it is recommended to use hot deck methods (nearest neighbour (left), maximum likelihood, …) when data are missing at random.Conversely, MNAR features should ideally be imputed with a left-censor (minimum value (right), but not zero, …) method.
## impute missing values using knn imputation
tmp <- impute(naset, method = "knn")
## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
## mean imputation used for these rows
processingData(tmp)
## - - - Processing information - - -
## Data imputation using knn Tue Jul 10 15:18:34 2018
## Using default parameters
## MSnbase version: 1.15.6
There are various methods to perform data imputation, as described in ?impute.
R in general and Bioconductor in particular are well suited for the statistical analysis of data of quantitative proteomics data. Several packages provide dedicated resources for proteomics data:
MSstats: A set of tools for statistical relative protein significance analysis in Data dependent (DDA), SRM and Data independent acquisition (DIA) experiments. Data stored in data.frame or MSnSet objects can be used as input.
msmsTests: Statistical tests for label-free LC-MS/MS data by spectral counts, to discover differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. All can be readily applied on MSnSet instances produced, for example by MSnID.
isobar also provides dedicated infrastructure for the statistical analysis of isobaric data.
The MLInterfaces package provides a unified interface to a wide range of machine learning algorithms. Initially developed for microarray and ExpressionSet instances, the pRoloc package enables application of these algorithms to MSnSet data.
Dimensionality reduction is very frequently used to summarise high-dimensional data. Below we will use principal component analysis (PCA), but other methods can be applied. Below, we will use the plot2D function from the pRoloc package8 While originally developed for the analysis of spatial/organelle proteomics data in mind, it is applicable many use cases., that will extract the expression values in the assay data, perform dimensionality reduction, an produce the scatter plot.
Let’s first use plot2D to visualise the pattern in 20 protein quantitation values (initial 20 dimensional data). Here, we use an example from spatial proteomics, where the quantitative protein profiles reflect the proteins sub-cellular localisation (from Christoforou et al, 2016, see also Breckels et al, 2016 for more data analysis background). We will use the known localisation of some proteins (marker proteins) to annotate the plot (using the fcol argument).
library("pRoloc")
library("pRolocdata")
data(hyperLOPIT2015)
plot2D(hyperLOPIT2015, fcol = "markers")
addLegend(hyperLOPIT2015, fcol = "markers", cex = .7)
PCA plot for protein sub-cellular localisation.
In other cases, we want to visualise the relation of samples. plot2D uses the rows of the data to perform dimensionality reduction. To use the columns, we just need to transpose the MSnSet. By doing so, the pData becomes the fData and vice versa.
Let’s use a time-course experiment on stem cells (Mulvey et al. 2015). Below, we use the times (time points) variable to set colours and rep (replicate numbers) to set the plotting characters.
data(mulvey2015)
head(pData(mulvey2015))
## rep times cond
## rep1_0hr 1 1 1
## rep1_16hr 1 2 1
## rep1_24hr 1 3 1
## rep1_48hr 1 4 1
## rep1_72hr 1 5 1
## rep1_XEN 1 6 1
plot2D(t(mulvey2015), fcol = "times", fpch = "rep", cex = 2)
addLegend(t(mulvey2015), fcol = "times")
## Classification
The example below uses knn with the 5 closest neighbours and the MLInterfaces package as an illustration to classify proteins of unknown sub-cellular localisation to one of 9 possible organelles.
library("MLInterfaces")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5),
## trainInd = traininds)
## Predicted outcome distribution for test set:
##
## ER lumen ER membrane Golgi Mitochondrion Plastid
## 5 140 67 51 29
## PM Ribosome TGN vacuole
## 89 31 6 10
## Summary of scores on test set (use testScores() method for details):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 1.0000 1.0000 0.9332 1.0000 1.0000
A wide range of classification and clustering algorithms are also available, as described in the ?MLearn documentation page, used below.
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
## clusteringOutput: partition table
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 55 28 29 18 60 30 44 145 78 116 33 53
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)
plot(kcl, exprs(dunkley2006))
Kmeans clustering using r Biocpkg('MLInterfaces') with an MSnSet object.
All the Bioconductor annotation infrastructure, such as biomaRt, GO.db, organism specific annotations, .. are directly relevant to the analysis of proteomics data. A total of 191 ontologies, including some proteomics-centred annotations such as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI 2.5) or Protein Modifications are available through the rols
library("rols")
res <- OlsSearch(q = "ESI", ontology = "MS", exact = TRUE)
## Error: lexical error: invalid char in json text.
## <!DOCTYPE html PUBLIC "-//W3C//
## (right here) ------^
res
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: 1 2 ... 888 (888 total)
## fvarLabels: Protein.ID FBgn ... pd.markers (12 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## MSnbase version: 2.6.1
There is a single exact match (default is to retrieve 20 results), that can be retrieved and coerced to a Terms or data.frame object with
res <- olsSearch(res)
## Error in handle_url(handle, url, ...): no slot of name "url" for this object of class "MSnSet"
as(res, "Terms")
## Error in as(res, "Terms"): no method or default for coercing "MSnSet" to "Terms"
as(res, "data.frame")
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## X10 X11 X12 X13 X14 X15 X16 X17
## X18 X19 X20 X21 X22 X23 X24 X25
## X26 X27 X28 X29 X30 X31 X32 X33 X34
## X35 X36 X37 X38 X39 X40 X41 X42
## X43 X44 X45 X46 X47 X48 X49 X50
## X51 X52 X53 X54 X55 X56 X57 X58 X59
## X60 X61 X62 X63 X64 X65 X66 X67
## X68 X69 X70 X71 X72 X73 X74
## X75 X76 X77 X78 X79 X80 X81 X82
## X83 X84 X85 X86 X87 X88 X89
## X90 X91 X92 X93 X94 X95 X96 X97
## X98 X99 X100 X101 X102 X103 X104 X105 X106
## X107 X108 X109 X110 X111 X112 X113 X114
## X115 X116 X117 X118 X119 X120 X121 X122
## X123 X124 X125 X126 X127 X128 X129 X130
## X131 X132 X133 X134 X135 X136 X137 X138
## X139 X140 X141 X142 X143 X144 X145 X146
## X147 X148 X149 X150 X151 X152 X153 X154
## X155 X156 X157 X158 X159 X160 X161 X162
## X163 X164 X165 X166 X167 X168 X169 X170 X171
## X172 X173 X174 X175 X176 X177 X178 X179 X180
## X181 X182 X183 X184 X185 X186 X187 X188 X189
## X190 X191 X192 X193 X194 X195 X196 X197 X198
## X199 X200 X201 X202 X203 X204 X205 X206 X207
## X208 X209 X210 X211 X212 X213 X214 X215
## X216 X217 X218 X219 X220 X221 X222 X223
## X224 X225 X226 X227 X228 X229 X230 X231 X232
## X233 X234 X235 X236 X237 X238 X239 X240 X241
## X242 X243 X244 X245 X246 X247 X248 X249
## X250 X251 X252 X253 X254 X255 X256 X257 X258
## X259 X260 X261 X262 X263 X264 X265 X266
## X267 X268 X269 X270 X271 X272 X273 X274 X275
## X276 X277 X278 X279 X280 X281 X282 X283 X284
## X285 X286 X287 X288 X289 X290 X291 X292
## X293 X294 X295 X296 X297 X298 X299 X300 X301
## X302 X303 X304 X305 X306 X307 X308 X309
## X310 X311 X312 X313 X314 X315 X316 X317
## X318 X319 X320 X321 X322 X323 X324 X325
## X326 X327 X328 X329 X330 X331 X332
## X333 X334 X335 X336 X337 X338 X339
## X340 X341 X342 X343 X344 X345 X346
## X347 X348 X349 X350 X351 X352 X353 X354
## X355 X356 X357 X358 X359 X360 X361 X362 X363
## X364 X365 X366 X367 X368 X369 X370 X371 X372
## X373 X374 X375 X376 X377 X378 X379 X380
## X381 X382 X383 X384 X385 X386 X387 X388 X389
## X390 X391 X392 X393 X394 X395 X396 X397
## X398 X399 X400 X401 X402 X403 X404 X405
## X406 X407 X408 X409 X410 X411 X412 X413 X414
## X415 X416 X417 X418 X419 X420 X421
## X422 X423 X424 X425 X426 X427 X428 X429
## X430 X431 X432 X433 X434 X435 X436 X437 X438
## X439 X440 X441 X442 X443 X444 X445 X446
## X447 X448 X449 X450 X451 X452 X453 X454
## X455 X456 X457 X458 X459 X460 X461 X462
## X463 X464 X465 X466 X467 X468 X469 X470
## X471 X472 X473 X474 X475 X476 X477 X478
## X479 X480 X481 X482 X483 X484 X485 X486
## X487 X488 X489 X490 X491 X492 X493 X494
## X495 X496 X497 X498 X499 X500 X501 X502 X503
## X504 X505 X506 X507 X508 X509 X510 X511
## X512 X513 X514 X515 X516 X517 X518 X519
## X520 X521 X522 X523 X524 X525 X526 X527
## X528 X529 X530 X531 X532 X533 X534 X535 X536
## X537 X538 X539 X540 X541 X542 X543 X544
## X545 X546 X547 X548 X549 X550 X551 X552
## X553 X554 X555 X556 X557 X558 X559 X560
## X561 X562 X563 X564 X565 X566 X567 X568 X569
## X570 X571 X572 X573 X574 X575 X576 X577
## X578 X579 X580 X581 X582 X583 X584 X585
## X586 X587 X588 X589 X590 X591 X592 X593 X594 X595
## X596 X597 X598 X599 X600 X601 X602 X603
## X604 X605 X606 X607 X608 X609 X610 X611
## X612 X613 X614 X615 X616 X617 X618 X619 X620
## X621 X622 X623 X624 X625 X626 X627 X628 X629
## X630 X631 X632 X633 X634 X635 X636 X637
## X638 X639 X640 X641 X642 X643 X644 X645
## X646 X647 X648 X649 X650 X651 X652 X653 X654
## X655 X656 X657 X658 X659 X660 X661 X662
## X663 X664 X665 X666 X667 X668 X669 X670
## X671 X672 X673 X674 X675 X676 X677 X678 X679
## X680 X681 X682 X683 X684 X685 X686
## X687 X688 X689 X690 X691 X692 X693 X694
## X695 X696 X697 X698 X699 X700 X701 X702
## X703 X704 X705 X706 X707 X708 X709 X710 X711
## X712 X713 X714 X715 X716 X717 X718 X719
## X720 X721 X722 X723 X724 X725 X726 X727 X728
## X729 X730 X731 X732 X733 X734 X735 X736 X737
## X738 X739 X740 X741 X742 X743 X744 X745
## X746 X747 X748 X749 X750 X751 X752 X753
## X754 X755 X756 X757 X758 X759 X760 X761 X762
## X763 X764 X765 X766 X767 X768 X769 X770
## X771 X772 X773 X774 X775 X776 X777 X778
## X779 X780 X781 X782 X783 X784 X785 X786
## X787 X788 X789 X790 X791 X792 X793 X794
## X795 X796 X797 X798 X799 X800 X801 X802
## X803 X804 X805 X806 X807 X808 X809 X810
## X811 X812 X813 X814 X815 X816 X817
## X818 X819 X820 X821 X822 X823 X824 X825 X826
## X827 X828 X829 X830 X831 X832 X833 X834
## X835 X836 X837 X838 X839 X840 X841 X842
## X843 X844 X845 X846 X847 X848 X849 X850
## X851 X852 X853 X854 X855 X856 X857 X858
## X859 X860 X861 X862 X863 X864 X865 X866
## X867 X868 X869 X870 X871 X872 X873 X874
## X875 X876 X877 X878 X879 X880 X881 X882 X883
## X884 X885 X886 X887 X888
## [ reached getOption("max.print") -- omitted 4 rows ]
Data from the Human Protein Atlas is available via the hpar package.
sessionInfo()
## R version 3.5.0 Patched (2018-05-14 r74725)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.0.0 bindrcpp_0.2.2 dplyr_0.7.6
## [4] msdata_0.20.0 AnnotationHub_2.12.0 knitr_1.20
## [7] hpar_1.22.2 rols_2.8.1 MSGFplus_1.14.0
## [10] pRolocdata_1.18.0 rpx_1.16.0 MSnID_1.14.0
## [13] mzID_1.18.0 pRoloc_1.20.1 MLInterfaces_1.60.1
## [16] cluster_2.0.7-1 annotate_1.58.0 XML_3.98-1.11
## [19] AnnotationDbi_1.42.1 IRanges_2.14.10 S4Vectors_0.18.3
## [22] gridExtra_2.3 lattice_0.20-35 RforProteomics_1.18.1
## [25] BiocInstaller_1.30.0 MSnbase_2.6.1 ProtGenerics_1.12.0
## [28] BiocParallel_1.14.2 mzR_2.14.0 Rcpp_0.12.17
## [31] Biobase_2.40.0 BiocGenerics_0.26.0 BiocStyle_2.8.2
##
## loaded via a namespace (and not attached):
## [1] R.utils_2.6.0 RUnit_0.4.32
## [3] tidyselect_0.2.4 RSQLite_2.1.1
## [5] htmlwidgets_1.2 grid_3.5.0
## [7] trimcluster_0.1-2 lpSolve_5.6.13
## [9] rda_1.0.2-2 munsell_0.5.0
## [11] codetools_0.2-15 preprocessCore_1.42.0
## [13] DT_0.4 withr_2.1.2
## [15] colorspace_1.3-2 Category_2.46.0
## [17] highr_0.7 geometry_0.3-6
## [19] robustbase_0.93-1 dimRed_0.1.0
## [21] labeling_0.3 mnormt_1.5-5
## [23] hwriter_1.3.2 bit64_0.9-7
## [25] ggvis_0.4.3 rprojroot_1.3-2
## [27] ipred_0.9-6 randomForest_4.6-14
## [29] diptest_0.75-7 R6_2.2.2
## [31] doParallel_1.0.11 gridSVG_1.6-0
## [33] flexmix_2.3-14 DRR_0.0.3
## [35] bitops_1.0-6 assertthat_0.2.0
## [37] promises_1.0.1 scales_0.5.0
## [39] nnet_7.3-12 gtable_0.2.0
## [41] affy_1.58.0 biocViews_1.48.2
## [43] ddalpha_1.3.4 timeDate_3043.102
## [45] rlang_0.2.1 CVST_0.2-2
## [47] genefilter_1.62.0 RcppRoll_0.3.0
## [49] splines_3.5.0 lazyeval_0.2.1
## [51] ModelMetrics_1.1.0 impute_1.54.0
## [53] hexbin_1.27.2 broom_0.4.5
## [55] yaml_2.1.19 reshape2_1.4.3
## [57] abind_1.4-5 threejs_0.3.1
## [59] crosstalk_1.0.0 backports_1.1.2
## [61] httpuv_1.4.4.2 RBGL_1.56.0
## [63] caret_6.0-80 tools_3.5.0
## [65] lava_1.6.2 psych_1.8.4
## [67] gplots_3.0.1 affyio_1.50.0
## [69] RColorBrewer_1.1-2 proxy_0.4-22
## [71] plyr_1.8.4 base64enc_0.1-3
## [73] progress_1.2.0 zlibbioc_1.26.0
## [75] purrr_0.2.5 RCurl_1.95-4.10
## [77] prettyunits_1.0.2 rpart_4.1-13
## [79] viridis_0.5.1 sampling_2.8
## [81] sfsmisc_1.1-2 magrittr_1.5
## [83] data.table_1.11.4 pcaMethods_1.72.0
## [85] mvtnorm_1.0-8 whisker_0.3-2
## [87] R.cache_0.13.0 hms_0.4.2
## [89] mime_0.5 evaluate_0.10.1
## [91] xtable_1.8-2 mclust_5.4.1
## [93] compiler_3.5.0 biomaRt_2.36.1
## [95] tibble_1.4.2 KernSmooth_2.23-15
## [97] crayon_1.3.4 R.oo_1.22.0
## [99] htmltools_0.3.6 later_0.7.3
## [101] tidyr_0.8.1 lubridate_1.7.4
## [103] DBI_1.0.0 magic_1.5-8
## [105] MASS_7.3-50 fpc_2.1-11
## [107] Matrix_1.2-14 vsn_3.48.1
## [109] R.methodsS3_1.7.1 gdata_2.18.0
## [111] mlbench_2.1-1 bindr_0.1.1
## [113] gower_0.1.2 igraph_1.2.1
## [115] pkgconfig_2.0.1 foreign_0.8-70
## [117] recipes_0.1.3 MALDIquant_1.18
## [119] xml2_1.2.0 foreach_1.4.4
## [121] prodlim_2018.04.18 stringr_1.3.1
## [123] digest_0.6.15 pls_2.6-0
## [125] graph_1.58.0 rmarkdown_1.10
## [127] dendextend_1.8.0 GSEABase_1.42.0
## [129] curl_3.2 kernlab_0.9-26
## [131] shiny_1.1.0 gtools_3.8.1
## [133] modeltools_0.2-21 nlme_3.1-137
## [135] jsonlite_1.5 interactiveDisplay_1.18.0
## [137] viridisLite_0.3.0 limma_3.36.2
## [139] pillar_1.2.3 httr_1.3.1
## [141] DEoptimR_1.0-8 survival_2.42-4
## [143] interactiveDisplayBase_1.18.0 glue_1.2.0
## [145] FNN_1.1 gbm_2.1.3
## [147] prabclus_2.2-6 iterators_1.0.9
## [149] bit_1.1-14 class_7.3-14
## [151] stringi_1.2.3 blob_1.1.1
## [153] caTools_1.17.1 memoise_1.1.0
## [155] e1071_1.6-8